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. Abstract. While in general trading off exploration and exploitation in 

(~) ' reinforcement learning is hard, under some formulations relatively simple 

solutions exist. Optimal decision thresholds for the multi-armed bandit 



< 



problem, one for the infinite horizon discounted reward case and one for 
lO ' the finite horizon undiscounted reward case are derived, which make the 

fink between the reward horizon, uncertainty and the need for exploration 
explicit. From this result follow two practical approximate algorithms, 
.^^ I which are illustrated experimentally. 
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1 Introduction 

In reinforcement learning, the dilemma between selecting actions to maximise 
f— ^ ' the expected return according to the current world model and to improve the 

world model such as to potentially be able to achieve a higher expected return is 
^^ ' referred to as the exploration- exploitation trade-off. This has been the subject of 

JnI , much interest before, one of the earliest developments being the theory of sequen- 

tial sampling in statistics, as developed by [1]. This dealt mostly with making 
^^ , sequential decisions for accepting one among a set of particular hypothesis, with 

" -^ — ' a view towards applying it to jointly decide the termination of an experiment and 

fi , the acceptance of a hypothesis. A more general overview of sequential decision 

problems from a Bayesian viewpoint is offered in [2]. 

The optimal, but intractable, Bayesian solution for bandit problems was given 
^^ ■ in [3] , while recently tight bounds on the sample complexity of exploration have 

JH , been found [4] . An approximation to the full Bayesian case for the general rein- 

forcement learning problem is given in [5], while an alternative technique based 
on eliminating actions which are confidently estimated as low-value is given in 

[6]. 

The following section formulates the intuitive concept of trading exploration 
and exploitation as a natural consequence of the definition of the problem of 
reinforcement learning. After the problem definitions which correspond to either 
extreme are identified. Sec. 3 derives a threshold for switching from exploratory 
to greedy behaviour in bandit problems. This threshold is found to depend on the 
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effective reward horizon of the optimal policy and on our current belief distribu- 
tion of the expected rewards of each action. A sketch of the extension to MDPs 
is presented in Sec. 4. Section 5 uses an upper bound on the value of exploration 
to derive practical algorithms, which are then illustrated experimentally in Sec. 
6. We conclude with a discussion on the relations with other methods. 



2 Exploration Versus Exploitation 

Let us assume a standard multi-armed bandit setting, where a reward dis- 
tribution p{rt+i\at) is conditioned on actions in at G A, with rt G K. The 
aim is to discover a policy n = {P{at = i)\i G A} for selecting actions such 
that i<^[rt+i|7r] is maximised. It follows that the optimal gambler, or oracle, for 
this problem would constitute a policy which always chooses i lE A such that 
E[rt+i\at = i] > E[rt^i\at = j] for all j G A. Given the conditional expectations, 
implementing the oracle is trivial. However this tells us little about the optimal 
way to select actions when the expectations are unknown. As it turns out, the 
optimal action selection mechanism will depend upon the problem formulation. 
We initially consider the two simplest cases in order to illustrate that the ex- 
ploration/exploitation tradeoff is and should be viewed in terms of problem and 
model definition. 

In the first problem formulation the objective is to discover a parameterized 
probabilistic policy tt — {P{at\9t) | at G A}, with parameters 9t, for selecting 
actions such that -E[rt_|_i|7r] is maximised. If we consider a model whose param- 
eters are the set of estimates 9t — {qi = Et[rt+i\at = i] \ i € A}, then the 
optimal choice is to select at for which the estimated expected value of the re- 
ward is highest, because according to our current belief any other choice will 
necessarily lead to a lower expectation. Thus, stating the bandit problem in this 
way does not allow the exploration of seemingly lower, but potentially higher 
value actions and it results in a greedy policy. 

In the second formulation, we wish to minimise the discrepancy between our 
estimate qi and the true expectation. This could be written as the following 
minimisation problem: 

Y,E[\\rt+i-q,f \at = i]. 

For point estimates of the expected reward, this requires sampling uniformly 
from all actions and thus represents a purely exploratory policy. If the problem 
is stated as simply minimising the discrepancy asymptotically, then uniformity 
is not required and it is only necessary to sample from all actions infinitely often. 
This condition holds when P{at — i) > Vi € A, t > and can be satisfied 
by mixing the optimal policies for the two formulations, with a probability e of 
using the uniform action selection and a probability 1 — e of using the greedy 
action selection. This results in the well-known e-greedy policy (see for example 
[7]), with the parameter e G [0, 1] used to control exploration. 



This formulation of the exploration-exploitation problem, though leading to 
an intuitive result, does not lead to an obvious way to optimally select actions. In 
the following section we shall consider bandit problems for which the functional 
to be maximised is 



E 



N 
fe=0 



g(fc)e[0,i], iv>0, 



with X]fe°=o ff(^) ^ '-^- ^^ ^^^^ formulation of the problem we are not only in- 
terested in maximising the expected reward at the next time step, but in the 
subsequent N steps, with the g{-) function providing another convenient way 
to weigh our preference among short and long-term rewards. Intuitively it is 
expected that the optimal policy for this problem will be different depending 
on how long-term are the rewards that we are interested in. As will be shown 
later, by lengthening the effective reward horizon through manipulation of g 
and N, i.e. by changing the definition of the problem that we wish to solve, the 
exploration bias is increased automatically. 



3 Optimal Exploration Threshold for Bandit Problems 

We want to know when it is a better decision to take action i rather than some 
other action J, with i,j^A, given that we have estimates qi, qj for E[rt-f-i\at — i] 
and E[rt+i\at — j] respectively^. We shall attempt to see under which conditions 
it is better to take an action different than the one whose expected reward is 
greatest. For this we shall need the following assumption: 

Assumption 1 (Expected rewards are bounded from below). There ex- 
ists 6 G M such that 

E[rt+i\at ^ i] > b yieA, (1) 

The above assumption is necessary for imposing a lower bound on the expected 
return of exploratory actions: no matter what action is taken, we are guaranteed 
that E[rt] > b. Without this condition, exploratory actions would be too risky 
to be taken at all. 

Given two possible actions to take, where one action is currently estimated 
to have a lower expected reward than the other, then it might be worthwhile to 
pursue the lower-valued action if the following conditions are true: (a) there is a 
degree of uncertainty such that the lower- valued action can potentially be better 
than the higher-valued one, (b) we are interested in maximising more than just 
the expectation of the next reward, but the expectation of a weighted sum of 
future rewards, (c) we will be able to accurately determine whether one action 
is better than the other quickly enough, so that not a lot of resources will be 
wasted in exploration. 



^ For bandit problems with states in a state space S, similar arguments can be made 
by considering i,j £ S x A. 



We now start viewing qi as random variables for which we hold belief distri- 
butions p{qi), with Qi — E[qi\ — E[rt+i\at — i]. The problem can be defined as 
deciding when action «, is better than taking action j, under the condition that 
doing so allows us to determine whether Qi > qj + S with high probability after 
T > 1 exploratory actions. For this reason we will need the following bound on 
the expected return of exploration. 

Lemma 1 (Exploration bound). For any return of the form Rt = X]fc=o 5(^)^t+fe+i7 
with g{k) > 0, assuming (1) holds, the expected return of taking action i for T 
time-steps and following a greedy policy thereafter, when qi > qj, is bounded 
below by 

N 



Uil,J,T,5,b) ^Y.g{k){{q,+5)P{q, > q, + 5) + c[,P{q, < q, + 6)) 
T-1 

Y, 9{k){{q, + 5)P{q^ > q, + <5) + bP{q, < q, + 5)) (2) 



k=T 

T-1 



fc=0 

for some 5 > Q. 

This follows immediately from Assumption 1. The greedy behaviour supposes 
we are following a policy where we continue to perform i if we know that P{qi > 
qj + (5) « 1 after T steps and switch back to j otherwise. 

Without loss of generality, in the sequel we will assume that 6 = (If expected 
rewards are bounded by some b ^ 0, we can always subtract b from all rewards 
and obtain the same). For further convenience, we set pi — P{qi > qj + 5). Then 
we may write that we must take action i if the expected return of simply taking 
action j is smaller than the expected return of taking action i for T steps and 
then behaving greedily, i.e. if the following holds: 

N N T-1 

Yg{k)qj < Y g{k){{qj + 5)p^ + qj{l - p^)) + Yg{k){qj + 5)p^ 

k=0 k=T k=0 

(3) 

T-1 JV 

Y, 9{k) {q, ~ {q, + 6)p^) < Y 9{k) {5p^) (4) 

fe=0 k=T 

Let g{k) = 7'^, with 7 G [0, 1]. In this case, any choice of T can be made 
equivalent to T = 1 by dividing everything with X]/c=o i- ^^ explore two 
cases: 7 < 1, N ^ 00 and 7 = 1, A^ < 00. In the first case, which corresponds 
to infinite horizon exponentially discounted reward maximisation problems, we 
obtain the following: 

00 
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It possible to simplify this expression considerably. When P{qi > Qj + 6) ~ 1/2, 
it follows from (6) that 

gj - (gj + 5)/ 2 ^ qj-S 

Thus, for infinite horizon discounted reward maximisation problems, when it is 
known that the all expected rewards are non-negative, all we need to do is find 
6 such that P{qi > qj + S) = 1/2. Then (7) can be used to make a decision on 
whether it is worthwhile to perforin exploration. Although it might seem strange 
the qi is omitted from this expression, its value is implicitly expressed through 
the value of S. 

In the second case, finite horizon cumulative reward maximisation problems, 
exploration should be performed when the following condition is satisfied: 

NSp^ > q-j - (qj + S)pi (8) 

Here the decision making function is of a different nature, since it depends on 
both estimates. However, in both cases, the longer the effective horizon be- 
comes and the larger the uncertainty is, the more the bias towards exploration 
is increased. We furthermore note that in the finite horizon case, the backward 
induction procedure can be used to make optimal decisions (see [2] Sec. 12.4). 



3.1 Solutions for Specific Distributions 

If we have a specific form for the distribution P{qi > qj + 6) it may be possible 
to obtain analytical solutions. To see how this can be achieved, consider that 
from (6), we have: 

- ^ - X P^ 



I -Pi 



a ^ K -^(* > 9i + '^) (^ \- lc\\ 

'^ \-Piq,>q,+S) -^'-^^'^^^ (') 

recalling that all mean rewards arc non-negative. 

If this condition is satisfied for some 6 then exploration must be performed. 
We observe that if the first term is maximised for some 6* for which the inequality 
is not satisfied, then there is no (5 7^ S* that can satisfy it. Thus, we can attempt 
to examine some distributions for which this S* can be determined. We shall 
restrict ourselves to distributions that are bounded below, due to Assumption 1. 



3.2 Solutions for the Exponential Distribution 

One such distribution is the exponential distribution, defined as 

P{X >S)= /3e-^(="-^)dx = e-''^^'^-''^ 



ii 6 > n, 1 otherwise. We may plug this into (9) as follows 



1 - P{q, > Qj+S) 1 - e-ftCM.+'S-A'O eft(A'j+'5-M.) _ i 

Now we should attempt to find S* — argmax^ /((5). We begin by taking the 
derivative with respect to S. Set g{5) — e'*'"'^ — 1, h{S) — (3i{qj + 6 — fii) 

Vf(S) ^ gC'^) - ^^9iS) ^ g{S) - Sl3,\/h9{S) ^ eM'Hl - 5(3,) - 1 
■'^ ' g{5f giSy (e'^W -1)2 

Necessary and sufficient conditions for some point 5* to be a local maximum for 
a continuous differentiable function f{5) are that \7s,f{5*) — and V'gf{S*) < 0. 
The necessary condition for 6 results in 

gft(9.+,5-pO(l_J^,^) = l_ (10) 

Unfortunately (10) has no closed form solution, but it is related to the Lambert 
W function for which iterative solutions do exist [8] . The found solution can then 
be plugged into (9) to see whether the conditions for exploration are satisfied. 



4 Extension to the General Case 

In the general reinforcement learning setting, the reward distribution does not 
only depend on the action taken but additionally on a state variable. The state 
transition distribution is conditioned on actions and has the Markov property. 
Each particular task within this framework can be summarised as a Markov 
decision process: 

Definition 1 (Markov decision process). A Markov decision process is de- 
fined by a set of states S, a set of actions A, a transition distribution T(s', s, a) = 
P(sj^]^|sj ^ s,at — a) and a reward distribution ^{s',s,a) ~ pi^t+ilst+i = 
s\st ^ s,at^ a). 

The simplest way to extend the bandit case to the more general one of MDPs 
is to find conditions under which the latter reduces to the former. This can be 
done for example by considering choices not between simple actions but between 
temporally extended actions, which we will refer to as options following [9]. We 
shall only need a simplified version of this framework, where each possible option 
X corresponds to some policy tt^ : 5 x ^ — > [0, 1]. This is sufficient for sketching 
the conditions under which the equivalence arises. 

In particular, we examine the case where we have two options. The first 
option is to always select actions according to some exploratory principle, such 
picking them from a uniform distribution. The second is to always select actions 
greedily, i.e. by picking the action with the highest expected return. 

We assume that each option will last for time T. One further necessary 
component for this framework is the notion of mixing time 



Definition 2 (Exploration mixing time). We define the exploration mixing 
time for a particular MDP A4 and a policy tt T^{A4,tt) as the expected number 
of time steps after which the state distribution is close to the stationary state 
distribution of tt after we have taken an exploratory action i at time step t, i.e. 
the expected number of steps T such that the following condition holds: 

T^X! Il^(**+r ^ s\st,Tr) - P(st+T = s\at = i,St,TT)\\ < e 

It is of course necessary for the MDP to be ergodic for this to be finite. If we only 
consider switching between options at time periods greater than T^{A4,tt), then 
the option framework's roughly corresponds to the bandit framework, and T^ in 
the former to T in the latter. This means that whenever we take an exploratory 
action i (one that does not correspond to the action that would have been 
selected by the greedy policy tt), the distribution of states would remain to be 
significantly different from that under tt for T^{A4, tt) time steps. Thus we could 
consider the exploration to be taking place during all of T^, after which we would 
be free to continue exploration or not. Although there is no direct correspondence 
between the two cases, this limited equivalence could be sufficient for motivating 
the use of similar techniques for determining the optimal exploration exploitation 
threshold in full MDPs. 

5 Optimistic Evaluation 

In order to utilise Lemma 1 in a practical setting we must define T in some sense. 
The simplest solution is to set T = 1, which results in an optimistic estimate for 
exploratory actions as will be shown below. By rearranging (2) we have 

N N /T-1 \ 

U{^,J,T,5,b)^J29ik)ql+J29ik)Sp^ + {l-p^)[J29mb-'l/))] (H) 

fc=0 fc=0 \fe=0 / 

from which it is evident, since qj > b and g{k) > 0, that U{i, j,Ti,6,b) > 
U{i,j,T2,S,b) when Ti < T2, thus U{i,j,l,S,b) > U{iJ,T,S,b) for any T > 1. 
This can now be used to obtain Alg. 1 for optimistic exploration. 

Nevertheless, testing for the existence of a suitable S can be costly since, 
barring an analytic procedure it requires an exhaustive search. On the other 
hand, it may be possible to achieve a similar result through sampling for different 
values of 6. Herein, the following sampling method is considered: Firstly, we 
determine the action j with the greatest qj. Then, for each action i we take a 
sample x from the distribution p((7i) and set 6 — x^Qj. This is quite an arbitrary 
sampling method, but we may expect to obtain a 6 > with high probability 
if i has a high probability to be significantly better than j. This method is 
summarised in Alg. 2. 

An alternative exploration method is given by Alg. 3, which samples each 
action with probability equal to the probability that its expected reward is the 
highest. It can perhaps be viewed as a crude approximation to Alg. 2 when 7 — > 1 
and has the advantage that it is extremely simple. 



Algorithm 1 Optimistic exploration 



if 3 S : U{i, j, 1, 5, b) >J2k=o9{k)qj then 

a ^ i 
else 

end if 



Algorithm 2 Optimistic stochastic exploration 



:; <^ argmaxjiji. 

Uj =J2k^o9{k)qj- 

for all i ^ j do 

d'^x-Qj, X ^ p{qi) 
Ui <^ U(i,j,l,5,b) 

end for 

a <= arg rnax^ Ui 



6 Experiments 

A small experiment was performed on a n-armed bandit problem with rewards 
rt € {0, 1} drawn from a Bernoulli distribution. Alg. 2 was used with g{k) = 7*^ 
and & = 0, which is in agreement with the distribution. This was compared with 
Alg. 3, which can be perhaps viewed as a crude approximation to Alg. 2 when 
7^1. The performance of e-greedy action selection with e = 0.01 was evaluated 
for reference. 

The e-greedy algorithm used point estimates for qi, which were updated with 
gradient descent with a step size of a = 0.01, such that for each action- reward 
observation tuple (a* = i,rt+i), qt <= a{rt+i — qi), with initial estimates being 
uniformly distributed in [0, 1]. In the other two cases, the complete distribution 
of qi was maintained via a population {pf }^o '-'^ point estimates, with K = 16. 
Each point estimate in the population was maintained in the same manner as the 
single point estimates in the e-greedy approach. Sampling actions was performed 
by sampling uniformly from the members of the population for each action. 

The results for two different bandit tasks, one with 16 and the other with 
128 arms, averaged over 1,000 runs, are summarised in Fig. 6. For each run, the 
expected reward of each bandit was sampled uniformly from [0, 1]. As can be seen 
from the figure, the e-greedy approach performs relatively well when used with 
reasonable first initial estimates. The sampling greedy approach, while having the 
same complexity, appears to perform better asymptotically. More importantly, 
Alg. 2 exhibits better long-term versus short-term performance when the effective 
reward horizon is increased as 7 — > 1. 



Algorithm 3 Sampling-greedy 

a <= i with probability P{a = i) = P{qi > qj) Vj 7^ i 



200 400 




000 1 200 



Fig. 1. Average reward in an multi-armed bandit task averaged over 1,000 experiments, 
smoothed witii a moving average over 10 time-steps. Results are shown for e-greedy 
(e-greedy), sampling-greedy (sampling) and Alg. 2 (opt) with 7 £ {0.5,0.9,0.99}. 



7 Discussion and Conclusion 



This paper has presented a formulation of an optimal exploration-exploitation 
threshold for in a n-armed bandit task, which links the need for exploration to 
the effective reward horizon and model uncertainty. Additionally, a practical al- 
gorithm, based on an optimistic bound on the value of exploration, is introduced. 
Experimental results show that this algorithm exhibits the expected long-term 
versus short-term performance trade-off when the effective reward horizon is 
increased. 

While the above formulation fits well within a reinforcement learning frame- 
work, other useful formulations may exist. In budgeted learning, any exploratory 
action results in a fixed cost. Such a formulation is used in [10] for the bandit 
problem (see also [11] for the active learning case). Then the problem essentially 
becomes that of how to best sample from actions in the next T moves such that 
the expected return of the optimal policy after T moves is maximised and corre- 
sponds to g{k) — \/k < T in the framework presented in this paper. A further 
alternative, described in [6] , is to stop exploring those parts of the state-action 
space which lead to sub-optimal returns with high probability. 

When a distribution or a confidence interval is available for expected returns, 
it is common to use the optimistic side of the confidence interval for action selec- 
tion [12]. This practice can be partially justified through the framework presented 
herein, or alternatively, through considering maximising the expected informa- 
tion to be gained by exploration, as proposed by [13]. In a similar manner, other 
methods which represent uncertainty as a simple additive factor to the nor- 
mal expected reward estimates, acquire further meaning when viewed through 
a statistical decision making framework. For example the Dyna-Q-|- algorithm 
(see [7] chap. 9) includes a slowly increasing exploration bonus for state-action 
pairs which have not been recently explored. From a statistical viewpoint, the 
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exploration bonus corresponds to a model of a non-stationary world, where un- 
certainty about past experiences increases with elapsed time elapsed. 

In general, the conditions defined in Sec. 3 require maintaining some type 
of belief distribution over the expected return of actions A natural choice for 
this would be to use a fully analytical Bayesian framework. Unfortunately this 
makes it more difficult to calculate P{qi > d), so it might be better to consider 
simple numerical approaches from the outset. We have previously considered 
some simple such estimates in [14], where we relied on estimating the gradient of 
the expected return with respect to the parameters. The estimated gradient was 
then used as a measure of uncertainty. Further research on the use of population- 
based methods for explicitly representing a distribution of estimates is currently 
under way. 
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